GeoCorpora: building a corpus to test and train microblog geoparsers

نویسندگان

  • Jan Oliver Wallgrün
  • Morteza Karimzadeh
  • Alan M. MacEachren
  • Scott Pezanowski
چکیده

In this article, we present the GeoCorpora corpus building framework and software tools as well as a geo-annotated Twitter corpus built with these tools to foster research and development in the areas of microblog/Twitter geoparsing and geographic information retrieval. The developed framework employs crowdsourcing and geovisual analytics to support the construction of large corpora of text in which the mentioned location entities are identified and geolocated to toponyms in existing geographical gazetteers. We describe how the approach has been applied to build a corpus of geo-annotated tweets that will be made freely available to the research community alongside this article to support the evaluation, comparison, and training of geoparsers. Additionally, we report lessons learned related to corpus construction for geoparsing as well as insights about the notions of place and natural spatial language that we derive from application of the framework to building this corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Microblog Retrieval from Exterior Corpus by Automatically Constructing Microblogging Corpus

A large-scale training corpus consisting of microblogs belonging to a desired category is important for highaccuracy microblog retrieval. Obtaining such a large-scale microblgging corpus manually is very time and laborconsuming. Therefore, some models for the automatic retrieval of microblogs from an exterior corpus have been proposed. However, these approaches may fail in considering microblog...

متن کامل

Improving Microblog Retrieval from Exterior Corpus by Automatically Constructing a Microblogging Corpus

A large-scale training corpus consisting of microblogs belonging to a desired category is important for highaccuracy microblog retrieval. Obtaining such a large-scale microblgging corpus manually is very time and laborconsuming. Therefore, some models for the automatic retrieval of microblogs from an exterior corpus have been proposed. However, these approaches may fail in considering microblog...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

Unsupervised WSD based on Automatically Retrieved Examples: The Importance of Bias

This paper explores the large-scale acquisition of sense-tagged examples for Word Sense Disambiguation (WSD). We have applied the “WordNet monosemous relatives” method to construct automatically a web corpus that we have used to train disambiguation systems. The corpus-building process has highlighted important factors, such as the distribution of senses (bias). The corpus has been used to trai...

متن کامل

Towards Scalable Emotion Classification in Microblog Based on Noisy Training Data

The availability of labeled corpus is of great importance for emotion classification tasks. Because manual labeling is too timeconsuming, hashtags have been used as naturally annotated labels to obtain large amount of labeled training data from microblog. However, the inconsistency and noise in annotation can adversely affect the data quality and thus the performance when used to train a classi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • International Journal of Geographical Information Science

دوره 32  شماره 

صفحات  -

تاریخ انتشار 2018